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Abstract 

This  paper  explores  the  problem  of  efficiently  ordering  interprocessor  communication 
operations  in  statically- scheduled  multiprocessors  for  iterative  dataflow  graphs.  In  most  digital 
signal  processing  applications,  the  throughput  of  the  system  is  significantly  affected  by  communi¬ 
cation  costs.  By  explicitly  modeling  these  costs  within  an  effective  graph-theoretic  analysis 
framework,  we  show  that  ordered  transaction  schedules  can  significantly  outperform  self-timed 
schedules  even  when  synchronization  costs  are  low.  However,  we  also  show  that  when  communi¬ 
cation  latencies  are  non-negligible,  finding  an  optimal  transaction  order  given  a  static  schedule  is 
an  NP-complete  problem,  and  that  this  intractability  holds  both  under  iterative  and  non-iterative 
execution.  We  develop  new  heuristics  for  finding  efficient  transaction  orders,  and  perform  an 
experimental  comparison  to  gauge  the  performance  of  these  heuristics. 

1.  Background 

This  paper  explores  the  problem  of  efficiently  ordering  interprocessor  communication 

(IPC)  operations  in  statically-scheduled  multiprocessors  for  iterative  dataflow  specifications.  An 

iterative  dataflow  specification  consists  of  a  dataflow  representation  of  the  body  of  a  loop  that  is 
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to  be  iterated  indefinitely.  Dataflow  programming  in  this  form  is  used  widely  in  the  design  and 
implementation  of  digital  signal  processing  (DSP)  systems. 

In  this  paper,  we  assume  that  we  are  given  a  dataflow  specification  of  an  application,  and 
an  associated  multiprocessor  schedule  (e.g.,  derived  from  scheduling  techniques  such  as  those 
presented  in  [6,  9,  18,  22]).  Our  objective  is  to  reduce  the  overall  IPC  cost  of  the  multiprocessor 
implementation,  and  the  associated  performance  degradation,  since  IPC  operations  result  in  sig¬ 
nificant  execution  time  and  power  consumption  penalties,  and  are  difficult  to  optimize  thoroughly 
during  the  scheduling  stage.  IPC  is  assumed  to  take  place  through  shared  memory,  which  could  be 
global  memory  between  all  processors,  or  could  be  distributed  between  pairs  of  processors  (e.g., 
hardware  first-in-first-out  queues  or  dual  ported  memory).  Such  simple  communication  mecha¬ 
nisms,  as  opposed  to  cross  bars  and  elaborate  interconnection  networks,  are  common  in  embed¬ 
ded  systems,  due  to  their  simplicity  and  low  cost. 

I. 1  Scheduling  dataflow  graphs 

Our  study  of  multiprocessor  implementation  strategies  in  this  paper  is  in  the  context  of 
homogeneous  synchronous  dataflow  (HSDF)  specifications.  In  HSDF,  an  application  is  repre¬ 
sented  as  a  directed  graph  in  which  vertices  {actors)  represent  computational  tasks  of  arbitrary 
complexity;  edges  {arcs)  specify  data  dependencies;  and  the  number  of  data  values  {tokens)  pro¬ 
duced  and  consumed  by  each  actor  is  fixed.  An  actor  executes  {fires )  when  it  has  enough  tokens 
on  its  input  arcs,  and  during  execution,  it  produces  tokens  on  its  output  arcs.  HSDF  imposes  the 
restriction  that  on  each  invocation,  each  actor  consumes  exactly  one  token  from  each  input  arc, 
and  produces  one  token  on  each  output  arc.  HSDF  and  closely-related  models  are  used  exten¬ 
sively  for  multiprocessor  implementation  of  embedded  signal  processing  systems  (e.g.,  see  [6,  10, 

II,  12]).  We  refer  to  an  HSDF  representation  of  an  application  as  an  application  graph. 
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For  multiprocessor  implementation  of  dataflow  graphs,  actors  in  the  graph  need  to  be 
scheduled.  Scheduling  can  be  divided  into  three  steps  [13]  —  assigning  actors  to  processors  (pro¬ 
cessor  assignment),  ordering  the  actors  assigned  to  each  processor  (actor  ordering),  and  deter¬ 
mining  when  each  actor  should  commence  execution.  All  of  these  tasks  can  either  be  performed  at 
run-time  or  at  compile  time  to  give  us  different  scheduling  strategies.  To  reduce  run-time  over¬ 
head  and  improve  predictability,  it  is  often  desirable  in  embedded  applications  to  carry  out  as 
many  of  these  steps  possible  at  compile  time  [13]. 

Typically,  there  is  limited  information  available  at  compile  time  since  the  execution  times 
of  the  actors  are  often  estimated  values.  These  may  be  different  from  the  actual  execution  times 
due  to  actors  that  display  run-time  variation  in  their  execution  times  because  of  conditionals  or 
data-dependent  loops  within  them,  for  example.  However,  in  a  number  of  important  embedded 
domains,  such  as  DSP,  it  is  widely  accepted  that  execution  time  estimates  are  reasonably  accurate, 
and  that  good  compile-time  decisions  can  be  based  on  them.  In  this  paper,  we  focus  on  scheduling 
methods  that  extensively  make  use  of  execution  time  estimates,  and  perform  the  first  two  steps  — 
processor  assignment  and  actor  ordering  —  at  compile  time. 

In  relation  to  the  scheduling  taxonomy  of  Lee  and  Ha  [13],  there  are  three  general  strate¬ 
gies  with  which  we  are  primarily  concerned  in  this  paper.  In  fully -static  (FS)  strategy,  all  three 

scheduling  steps  are  carried  out  at  compile  time,  including  the  determination  of  an  exact  firing 
time  for  each  actor.  In  the  self-timed  (ST)  strategy,  on  the  other  hand,  processor  assignment  and 
actor  ordering  are  performed  at  compile  time,  but  run-time  synchronization  is  used  to  determine 
actor  firing  times:  an  ST  schedule  executes  by  firing  each  actor  invocation  A  as  soon  as  it  can  be 
determined  via  synchronization  that  the  actor  invocations  on  which  A  is  dependent  have  all  com¬ 
pleted  execution. 
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The  FS  and  ST  methods  represent  two  extremes  in  the  class  of  scheduling  algorithms  con¬ 
sidered  in  this  paper.  The  ST  method  is  the  least  constrained  scheme  since  the  only  constraints  are 
the  IPC  dependencies,  and  it  is  tolerant  of  variations  in  execution  times,  while  the  FS  strategy 
only  works  when  tight  worst  case  execution  times  are  available,  and  forces  system  performance  to 
conform  to  the  available  worst  case  bounds.  When  we  ignore  IPC  costs,  the  ST  schedule  conse¬ 
quently  gives  us  a  lower  bound  on  the  average  iteration  period  of  the  schedule  since  it  executes  in 
an  ASAP  (as  soon  as  possible)  manner. 

The  ordered  transaction  ( OT)  method  [11,  23]  falls  in-between  these  two  strategies.  It  is 
similar  to  the  ST  method  but  also  adds  the  constraint  that  a  linear  ordering  of  the  communication 
actors  is  determined  at  compile  time,  and  enforced  at  run-time.  The  linear  ordering  imposed  is 
called  the  transaction  order  of  the  associated  multiprocessor  implementation. 

The  FS  and  OT  strategies  have  significantly  lower  overall  IPC  cost  since  all  of  the 
sequencing  decisions  associated  with  communication  are  made  at  compile  time.  The  ST  method, 
on  the  other  hand,  requires  more  IPC  cost  since  it  requires  synchronization  checks  to  guarantee 
the  fidelity  of  each  communication  operation  —  that  is,  to  guarantee  that  buffer  underflow  and 
overflow  are  consistently  avoided.  Significant  compile-time  analysis  can  be  performed  to  stream¬ 
line  this  synchronization  functionality  [3,  4]. 

The  metric  of  interest  to  us  in  this  paper  is  the  average  iteration  period  T .  Intuitively,  in 
an  iterative  execution  of  a  dataflow  graph,  the  iteration  period  is  the  number  of  cycles  that  it  takes 
for  each  of  the  actors  in  the  graph  to  execute  exactly  once  —  i.e.,  to  complete  a  single  graph  itera¬ 
tion.  Note  that  it  is  not  necessary  in  a  self-timed  schedule  for  the  iteration  period  to  be  the  same 
from  one  graph  iteration  to  the  next,  even  when  actor  execution  times  are  fixed  [24].  The  inverse 
of  the  average  iteration  period  T  gives  us  the  throughput  T  1 ,  which  is  the  average  number  of 


4  of  34 


graph  iterations  carried  out  per  unit  time. 


1.2  Terminology  and  notation 

We  denote  the  set  of  positive  integers  by  Z+ ,  the  set  of  natural  numbers  {0,  1,  2,  ...  }  by 
X  ,  and  the  number  of  elements  in  a  finite  set  S  by  |S| . 

With  each  actor  v  e  V  in  an  HSDF  specification  (V,  E) ,  we  associate  an  integer  exec(v) , 
which  denotes  the  execution  time  estimate  of  v ,  and  an  integer  proc(v) ,  which  denotes  the  pro¬ 
cessor  that  v  is  assigned  to  in  the  assignment  step.  Each  edge  (v  •,  v  •)  <=  E  has  a  non-negative 
integer  delay  associated  with  it,  which  is  denoted  by  delay  (v-,  v . ) .  These  delays  represent  initial 
tokens,  and  specify  dependencies  between  iterations  of  actors  in  iterative  execution.  For  example, 
if  the  tokens  produced  by  an  actor  v-  on  its  k  th  invocation  are  consumed  by  actor  v  -  on  its 
(k  +  2)  th  invocation,  the  edge  between  v-  and  v  -  would  have  a  delay  of  2. 

Every  edge  (v(-,  v  •)  induces  the  precedence  constraint 

start(vj,  k)  >  startup  k  -  delay  (vp  v-))  +  t(v{) ,  (1) 

where  start(x,  k)  e  Z+  denotes  the  starting  time  of  the  kth  invocation  of  an  actor  x .  Here, 
startfvf)  is  set  to  0  for  k  <  0  as  initial  conditions. 

A  path  in  a  directed  graph  (V,  E)  is  a  finite  sequence  (ev  e2,  . . .,  en) ,  where  each  et  is  in 
E,  and  snk(ei)  =  src(ei+  j) ,  for  i  =  1,  2,  ...,  (n  -  1) .  We  say  that  the  path  (ev  e2,  ■■■,  en)  is 
directed  from  src{ef)  to  snk(en ) .  A  path  that  is  directed  from  some  vertex  to  itself  is  called  a 
cycle.  Given  a  path  p  =  (e1,  e2,  ■■■,  en) ,  the  path  delay  of  p ,  denoted  Delay  (p) ,  is  given  by 

n 

Delay(p)  =  ^  delay  {ef  .  (2) 

i  =  1 

Each  cycle  c  in  a  dataflow  graph  must  satisfy  Delay  (c)  >  0  to  avoid  deadlock. 

The  evolution  of  a  self-timed  implementation  can  be  modeled  by  Sriram’s  IPC  graph 
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model  [24].  Given  an  application  graph  and  an  associated  self-timed  schedule,  the  IPC  graph, 
denoted  Gipc ,  is  constructed  by  instantiating  a  vertex  for  each  application  graph  actor,  connecting 
an  edge  from  each  actor  to  the  actor  that  succeeds  it  on  the  same  processor,  and  adding  an  edge 
that  has  unit  delay  from  the  last  actor  on  each  processor  to  the  first  actor  on  the  same  processor. 
Also,  for  each  application  graph  edge  ( x ,  y)  that  connects  actors  that  execute  on  different  proces¬ 
sors,  an  inter-processor  edge  is  instantiated  in  Gipc  from  x  to  y .  A  sample  application  graph  and 
a  self- timed  schedule  are  illustrated  in  Figure  1,  and  the  corresponding  IPC  graph  is  illustrated  in 
Figure  2. 

IPC  costs  (estimated  transmission  latencies  through  the  multiprocessor  network)  can  be 
incorporated  into  the  IPC  graph  model  by  explicitly  including  communication  ( send  and  receive) 
actors,  and  setting  the  execution  times  of  these  actors  to  equal  the  associated  IPC  costs. 

The  IPC  graph  is  an  instance  of  Reiter’s  computation  graph  model  [20],  also  known  as  the 
timed  marked  graph  model  in  Petri  net  theory  [19],  and  from  the  theory  of  such  graphs,  it  is  well 
known  that  in  the  ideal  case  of  unlimited  bus  bandwidth,  the  average  iteration  period  for  the 
ASAP  execution  of  an  IPC  graph  is  given  by  the  maximum  cycle  mean  (MCM)  of  Gjpc ,  which  is 


Self-Timed  Schedule 


Proc  1:  (1,  2,  3,  4,  6) 
Proc  2:  (5.  7,  8) 

Proc  3:  (9) 


Figure  1.  An  example  of  an  application  graph  and  an  associated  self-timed  schedule.  The 
numbers  on  the  edges  (6,  8)  and  (6,  9)  denote  nonzero  delays. 
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defined  by 


MCM(Gipc)  =  max 

cycle  C  in  Gipc 


^  exec(v) 
v  e  C 

Delay  (C) 


(3) 


The  quotient  in  (3)  is  referred  to  as  the  cycle  mean  of  the  associated  cycle  C . 

A  similar  data  structure  that  is  useful  in  analyzing  OT  implementations  is  Sriram’s  ordered 
transaction  graph  model  [24].  Given  an  ordering  O  =  {ov  o2,  •••o  }  for  the  communication 
actors  in  an  IPC  graph  Gipc  =  ( Vipc,  Ejpc) ,  the  corresponding  ordered  transaction  graph 
T(Gipc,0)  is  defined  as  the  directed  graph,  Gor  =  (V0T,E0T)  where  V 0T  =  Vjpc  , 

Eot  =  Eipc  u  E0 , 


E0  =  {( °p’°l )>  (4) 

delay  (o  ■,  oi+l)  =  0  for  1  <i  <p ,  and  delay  (o  ,  o , )  =  1 .  Thus,  an  IPC  graph  can  be  modified 
by  adding  edges  obtained  from  the  ordering  O  to  create  the  ordered  transaction  graph. 


2.  Previous  work 


In  [23,  24],  Sriram  and  Lee  discuss  some  of  the  advantages  and  disadvantages  of  the  OT 
strategy  compared  to  the  ST  strategy  —  in  particular,  lower  synchronization  and  arbitration  costs 
for  the  IPC  mechanism  at  the  expense  of  some  run-time  flexibility.  They  also  develop  a  method  to 
compute  an  optimum  transaction  order  when  a  fully-static  schedule  is  given  beforehand.  In  this 
approach,  a  set  of  inequalities  is  constructed  using  the  timing  information  of  the  given  FS  sched¬ 
ule,  and  represented  as  a  graph.  The  Bellman-Ford  shortest  path  algorithm  is  applied  to  this  graph 
to  obtain  new  starting  times  of  the  actors,  thereby  modifying  the  original  FS  schedule.  A  transac¬ 
tion  order  is  then  obtained  by  sorting  the  starting  times  of  the  communication  actors.  We  shall 
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term  this  method  of  finding  the  transaction  orders,  which  is  an  efficient  polynomial-time  algo¬ 
rithm,  the  Bellman  Ford  Based  ( BFB )  method.  Under  an  assumption  that  the  cost  (latency)  of  IPC 
is  zero,  Sriram  shows  that  the  transaction  order  determined  by  the  BFB  technique  is  always  opti¬ 
mum. 

However,  in  this  paper,  we  show  that  when  IPC  costs  are  not  negligible,  as  is  frequently 
and  increasingly  the  case  in  practice,  the  problem  of  determining  an  optimal  transaction  order  is 
NP-hard.  Thus,  under  nonzero  IPC  costs,  we  must  resort  to  heuristics  for  efficient  solutions.  Fur¬ 
thermore,  the  polynomial-time  BFB  algorithm  is  no  longer  optimal,  and  alternative  techniques 
that  account  for  IPC  costs  are  preferable. 

Numerous  approaches  have  been  proposed  for  incorporating  IPC  costs  into  the  assignment 
and  ordering  steps  of  scheduling  (e.g.,  [2,  22]).  The  techniques  that  we  propose  in  this  paper  are 
complementary  to  these  approaches  in  that  they  provide  a  means  for  mapping  the  resulting  sched¬ 
ules  into  efficient  OT  implementations,  which  eliminate  the  performance  and  power  consumption 
overhead  associated  with  run-time  synchronization  and  contention  resolution. 

3.  Comparison  of  self-timed  and  ordered  transaction  strategies 

Given  an  application  graph,  an  associated  multiprocessor  schedule,  and  an  FS  implemen¬ 
tation,  an  OT  implementation,  and  an  ST  implementation  for  the  schedule,  suppose  7'FS  ,  Tqt  , 
and  ^ST  ,  respectively,  denote  the  average  iteration  periods  of  the  corresponding  schedules.  In 
general,  when  IPC  costs  are  negligible,  TFS  >  Tqt  >  TSJ  [24].  This  is  because  the  ST  method  has 
the  least  constraints.  The  ST  schedule  only  has  assignment  and  ordering  constraints,  while  the  OT 
schedule  has  transaction  ordering  constraints  in  addition  to  those  constraints,  and  the  FS  schedule 
has  exact  timing  constraints  that  subsume  the  constraints  in  the  ST  and  OT  schedules.  ST  sched¬ 
ules  overlap  in  a  natural  manner,  and  eventually  settle  into  a  periodic  pattern  of  iterations.  This 
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pattern  can  be  exponential  in  size,  and  therefore,  the  ST  schedule  has  the  advantage  that  in  succes¬ 
sive  iterations,  the  transaction  order  may  be  different,  while  this  flexibility  is  not  available  for  the 
OT  and  FS  schedules. 

In  practical  cases,  however,  the  IPC  cost  is  non-zero.  Depending  on  the  bandwidth  of  the 
bus,  IPC  costs  may  be  quite  significant.  The  throughput  of  the  ST  schedule  can  be  computed  eas¬ 
ily  when  IPC  costs  are  ignored  by  calculating  the  MCM  of  the  corresponding  dataflow  graph  (i.e., 
via  (3)).  However,  when  IPC  costs  are  taken  into  account,  this  can  no  longer  be  done  since  the 
notion  of  bus  contention  comes  into  the  picture.  Not  only  do  the  communication  actors  in  the 
dataflow  graph  have  to  wait  for  sufficient  tokens  on  the  input  arcs  to  fire,  they  also  have  to  wait 
for  the  bus  to  be  available  —  i.e.,  no  other  communication  actor  should  be  accessing  the  bus  at  the 
same  instant  of  time.  Therefore,  the  throughput  of  the  self-timed  schedule  is  typically  derived 
using  simulation  techniques,  which  are  time-consuming.  On  the  other  hand,  the  throughput  of  the 
OT  schedule  can  still  be  obtained  by  calculating  the  MCM  of  the  transaction  order  graph  since 
there  will  be  no  bus  contention  when  a  linear  order  is  imposed  on  the  communication  actors  [23]. 

The  relation  7’FS  >  T(n  >  Tst  is  also  no  longer  valid  in  the  presence  of  non-zero  IPC 
costs.  To  see  why  this  is  true,  assume  that  two  communication  actors  become  enabled  (have  suffi¬ 
cient  input  tokens  to  fire)  at  more  or  less  the  same  time.  Then  the  ST  method  will  schedule  the 
communication  actor  that  becomes  enabled  earlier.  Doing  this  may  result  in  a  lower  throughput 
since,  for  example,  the  processor  that  contains  the  communication  actor  that  is  scheduled  later 
might  be  more  heavily  loaded.  The  FS  and  the  OT  methods  avoid  such  pitfalls  by  analyzing  the 
schedules  at  compile  time,  and  producing  an  exact  firing  time  assignment,  or  a  transaction  order 
that  takes  the  entire  schedule  into  consideration.  Intuitively,  the  ST  method  follows  a  more 
greedy,  ASAP  approach  in  choosing  which  communication  actor  to  schedule  next,  and  this  can 
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result  in  inefficient  execution  patterns. 

Example  1:  To  illustrate  how  an  ST  schedule  might  perform  worse  than  an  OT  schedule,  con¬ 
sider  the  IPC  graph  of  Figure  2.  Dashed  edges  represent  inter-processor  data  dependencies.  Num¬ 
bers  beside  actors  show  their  execution  times,  numbers  beside  edges  indicate  nonzero  delays,  xsy 
denotes  the  yth  send  actor  of  computation  actor  x,  and  xry  denotes  the  yth  receive  actor  of  x.  Fig¬ 
ure  3  shows  the  periodic  pattern  that  the  ST  schedule  eventually  settles  down  into.  Although  Pro¬ 
cessor  1  is  most  heavily  loaded,  we  see  that  there  are  instances  when  the  processor  is  idling 
waiting  for  the  bus  to  become  free.  In  contrast,  when  the  transaction  order 
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Figure  2.  IPC  graph  constructed  from  application  graph  of  Figure  1. 
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(55j,  2rh  lrv  Asx,  As2,  As3,  8 r ( ,  9/q)  is  enforced  (Figure  4),  an  11%  lower  average  iteration 
period  results.  This  is  because  the  transaction  order  is  computed  in  a  fashion  that  enables  the 
heavily  loaded  Processor  1  to  access  the  bus  whenever  it  requires  it.  Such  an  ability  to  prioritize 
strategically- selected  transactions  is  especially  important  in  heterogeneous  multiprocessors, 
which  often  have  imbalanced  loads  due  to  large  variations  in  processing  capabilities  of  the  com¬ 
puting  resources. 

The  ST  approach  has  the  further  disadvantage  that  in  the  presence  of  execution  time 
uncertainties,  there  is  no  known  method  for  computing  a  tight  worst-case  iteration  period,  even 
using  simulation  techniques.  In  particular,  the  period  of  the  ST  schedule  obtained  by  using  worst 
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Figure  3.  Gantt  Chart  for  ST  schedule  in  Example  1 . 
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Figure  4.  Gantt  Chart  for  OT  schedule  in  Example  1. 
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case  execution  time  estimates  of  the  actors  does  not  necessarily  give  us  the  worst  case  iteration 
period  of  a  schedule.  This  can  prove  to  be  a  big  disadvantage  in  real-time  systems  where  worst- 
case  bounds  are  needed  beforehand. 

Example  2:  Consider  the  IPC  graph  of  Figure  5,  and  suppose  that  Actor  1  has  a  worst-case  exe¬ 
cution  time  of  21,  and  a  best  case  execution  time  of  19.  Figure  6  shows  the  ST  schedule  that 
results  when  actor  1  has  an  execution  time  of  21.  An  iteration  period  of  50  is  obtained.  However, 
when  the  same  schedule  is  simulated  for  an  execution  time  of  19,  we  obtain  an  iteration  period  of 
59  as  shown  in  Figure  7  . 
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Figure  5.  IPC  graph  for  Example  2. 
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In  contrast,  the  iteration  period  obtained  by  computing  the  MCM  of  the  ordered  transac¬ 
tion  graph  with  worst-case  actor  execution  times  is  the  worst-case  iteration  period.  This  is  because 
the  MCM  is  an  accurate  measure  of  performance  for  ordered  transaction  implementations  [23,24], 
and  the  MCM  can  only  increase  or  remain  the  same  when  the  execution  time  of  an  actor  is 
increased. 

4.  Finding  optimal  transaction  orders 

In  the  transaction  ordering  problem ,  our  objective  is  to  determine  a  transaction  order  O 
for  a  given  IPC  graph  such  that  the  MCM  of  the  resulting  ordered  transaction  graph  is  minimized 
(so  that  throughput  is  maximized).  As  mentioned  in  Section  2,  it  has  been  shown  that  this  problem 
is  tractable  when  IPC  costs  are  ignored.  In  this  section,  we  show  that  when  IPC  costs  are  consid¬ 
ered,  the  transaction  ordering  problem  becomes  NP-complete. 

We  show  this  by  first  showing  that  determining  an  optimal  transaction  order  for  non-itera¬ 
tive  implementations,  which  is  a  more  restricted  (easier)  problem,  is  NP-complete.  To  convert  an 
iterative  IPC  graph  to  a  non-iterative  one,  it  suffices  to  remove  all  edges  in  the  graph  that  have 
delays  of  one  or  more.  This  results  in  an  acyclic  graph  since  any  cycle  in  the  original  graph  must 
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Figure  7.  Gantt  Chart  for  ST  schedule  with  exec(1)=19. 
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have  a  delay  of  one  or  more  for  the  graph  not  to  be  deadlocked. 


Definition  1:  Given  an  IPC  graph  Gipc  =  (V,  E) ,  the  associated  non-iterative  inter-processor 
communication  (NIPC)  graph  is  defined  as  Gnipc  =  (V,  Enjpc.)  where 
Enipc  =  {e\(e  e  £  and  (delay(e)  =  0))}  . 

Definition  2:  Given  an  NIPC  graph  Gnipc  =  (V.  Enipc) ,  and  an  ordering  O ,  the  corresponding 
non-iterative  ordered  transaction  (NOT)  graph  GN0T  =  I  I  ( Gnipc,  O )  is  defined  as 
Gnot  =  ( Vnot >  enot)  >  where  VN0T  =  V ,  EN0T  =  (Enipc  u  E0)  -  {( op ,  o^},  and  E0  is  as 
defined  in  (4). 

By  definition,  the  total  execution  time  ( makespan )  of  a  NOT  graph  GN0T  is  finite,  and 
this  execution  time  can  be  determined  in  polynomial  time  —  as  the  length  of  the  longest  cumula¬ 
tive-execution-time  path  in  GN0T  —  since  GN0T  is  acyclic  and  the  execution  times  of  all  actors 
are  nonnegative.  However,  given  an  IPC  graph,  finding  a  transaction  order  that  minimizes  the 
makespan  of  the  associated  NOT  graph  is  intractable. 

Definition  3:  The  non-iterative  transaction  ordering  problem  is  defined  as  follows.  Given  an 
NIPC  graph  Gnipc  =  (V,  Enipc) ,  and  a  positive  integer  k,  does  there  exist  a  transaction  order 
O  =  {ovo2,  ...on}  such  that  GN0T  =  Tl(Gnipc,  O)  has  a  makespan  that  is  less  than  or  equal  to 
kl 

To  show  that  non-iterative  transaction  ordering  is  NP  hard,  we  derive  a  reduction  from  the 
sequencing  with  release  times  and  deadlines  ( SRTD )  problem,  which  is  known  to  be  NP-complete 
[8].  The  SRTD  problem  is  defined  as  follows. 

Definition  4:  (The  SRTD  problem).  Given  an  instance  set  T  of  tasks,  and  for  each  task  t  e  T ,  a 
length  (duration)  l(t)  e  X  ,  a  release  time  r(t)  e  K  .  and  a  deadline  d(t)  e  X  ,  is  there  a  single- 
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processor  schedule  for  T  that  satisfies  the  release  time  constraints  and  meets  all  the  deadlines? 
That  is,  is  there  a  one-to-one  function  (called  a  valid  SRTD  schedule )  a  :  T  — >  N  ,  with 
(a(t)  >  a(O)  =>  (g(0  >  a(t')  +  l(t')) ,  and  for  all  t  e  T,  o(t)  >  r(t) ,  and  o(t)  +  l(t)  <  d{t)  ? 

Theorem  1:  The  non-iterative  transaction  ordering  problem  is  NP-complete. 

Proof:  This  problem  is  clearly  in  NP  since  we  can  verify  in  polynomial  time  whether  the  longest 
path  length  (in  terms  of  cumulative  execution  time)  of  the  graph  is  less  than  or  equal  to  a  given 
positive  integer. 

Now  suppose  that  we  are  given  an  instance  of  the  SRTD  problem  ( T ,  r,  l,  d)  with 
T  =  {fj,  t2. .  .tp)  .  We  construct  an  NIPC  graph  Gnipc  from  this  instance  by  carrying  out  the  fol¬ 
lowing  steps.  Here,  all  edges  instantiated  are  delayless  unless  otherwise  specified,  and  k  is  equal 
to  the  maximum  deadline  of  the  tasks  in  the  given  instance  of  the  STRD  problem. 

For  each  t  •  e  T, 

i)  instantiate  a  send  actor  uj  when  i  is  odd,  or  a  receive  actor  ui  when  i  is  even  with 
exec(uf)  =  l{tf)  and  proc(Uj)  =  i. 

ii)  instantiate  a  computation  actor  m;  with  e.vec(m;)  =  r(/()  and  pmcf  in^  =  i. 

iii)  instantiate  a  computation  actor  n(  with  execinf  =  k  -  c/(/()  and  procinf  =  i. 

iv)  instantiate  an  edge  (m-,  ut)  and  another  edge  (w-,  «•) . 

Each  send  actor  uj  is  connected  to  the  receive  actor  uj+  j  by  an  interprocessor  edge 
(u^  ui  +  | )  with  a  delay  of  unity.  Since  each  of  the  interprocessor  edges  has  a  delay  of  unity,  these 
edges  are  not  present  in  Gnipc .  Without  loss  of  generality,  we  assume  that  there  are  an  even  num¬ 
ber  of  tasks,  so  that  the  number  of  send  and  receive  actors  is  the  same  (if  the  number  of  tasks  is 
not  even  to  begin  with,  we  can  instantiate  an  appropriately-defined  dummy  actor  to  generate  an 
equivalent  “even-task”  instance).  Observe  from  our  construction  that  from  the  p  tasks  in  the 
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given  instance  of  the  SRTD  problem,  we  construct  a  graph  Gnipc  that  involves  p  processors,  p 
communication  actors,  2 p  computation  actors,  and  2 p  edges. 

Claim:  If  there  exists  a  transaction  order  O  for  GN0T  =  11  ( Gnipc,  O )  that  will  have  a  makespan 
that  is  less  than  or  equal  to  k ,  then  there  exists  a  valid  SRTD  schedule  for  the  given  instance  of  the 
SRTD  problem. 

The  reasoning  behind  our  construction  and  the  above  claim  is  that  we  make  the  communi¬ 
cation  actors  of  the  ordered  transaction  graph  correspond  exactly  to  the  tasks  of  the  STRD  prob¬ 
lem.  We  do  this  by  making  the  execution  time  of  the  computation  actor  before  each  corresponding 
communication  actor  equal  to  the  release  time  of  the  associated  task  and,  thus,  guarantee  that  the 
communication  actors  cannot  begin  execution  before  their  respective  release  times.  Also  since 
computation  actors  will  begin  execution  from  time  0  as  each  is  on  a  different  processor,  the 
release  times  correspond  to  when  they  complete  execution.  Similarly,  the  execution  time  of  the 
computation  actors  that  follow  the  communication  actors  are  chosen  to  be  k-  d(tj)  so  that  the 
corresponding  communication  actors  will  have  to  complete  their  execution  before  d(tj )  for  the 
makespan  to  be  less  than  or  equal  to  k .  This  is  true  because  the  computation  actor  can  begin  exe¬ 
cution  immediately  after  the  communication  actor  has  finished.  Therefore,  the  valid  SRTD  sched¬ 
ule  corresponds  exactly  to  the  shared  bus  schedule  in  the  derived  instance  of  the  non-iterative 
transaction  ordering  problem.  If  we  can  find  a  transaction  order  that  has  a  makespan  less  than  or 
equal  to  k,  we  have  a  bus  schedule  that  schedules  the  communication  actors  in  the  same  manner 
as  an  appropriate  single-processor  schedule  for  the  corresponding  SRTD  tasks.  Conversely,  if  a 
transaction  order  cannot  be  found  that  satisfies  the  given  makespan  constraint,  it  is  easily  seen  that 
there  is  no  valid  SRTD  schedule  for  the  given  instance  of  the  SRTD  problem.  Q.E.D. 

Note  that  in  Theorem  1,  we  have  simplified  the  problem  greatly  by  assuming  the  inter-pro- 
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cessor  edges  to  have  unit  delays.  This  removes  the  inter-dependencies  that  are  imposed  by  these 
edges,  but  even  with  this  simplification,  the  problem  remains  NP-complete. 

Example  3:  Suppose  that  we  are  given  an  instance  of  the  SRTD  problem  with  task  set 
T  =  {t j,  t2,  t3,  t4}  ;  and  respective  release  times  r(T)  =  {0,  4,  5,  6}  ,  lengths 
l(T)  =  {5,  2,  3,  1}  ,  and  deadlines  d(T )  =  {5,  8,  11,  8}  .  To  construct  an  instance  of  the  non¬ 
iterative  transaction  ordering  problem  with  k  =  1 1 ,  we  create  4  processors,  each  with  3  vertices. 
The  execution  times  are  determined  from  above  —  e.g.,  ul  =  5  ,ml  =  0,/7]  =  6  .  The  resulting 
NOT  graph  is  illustrated  in  Figure  8.  Dash-dot  edges  indicate  OT  edges.  Removing  the  dash-dot 
edges  that  represent  the  transaction  order  edges  gives  us  the  NIPC  graph  constructed  from  above. 
This  figure  shows  a  transaction  order  (ux,  u2,  u4,  w3)  where  the  schedule  length  of  11  is  satisfied. 
This  means  that  there  exists  a  valid  SRTD  schedule  for  the  given  SRTD  problem  instance.  The 
start  times  of  the  tasks  can  be  obtained  by  finding  the  longest  path  lengths  between  the  source 
nodes  and  the  corresponding  communication  actors.  Setting  the  starting  times  of  the  tasks 
(fj,  t2,  t3,  t4)  to  equal  (0,  5,  8,  7) ,  respectively,  we  obtain  a  valid  SRTD  schedule  for  the  SRTD 
problem  instance. 
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Figure  8.  NOT  Graph  constructed  in  Example  3. 
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As  demonstrated  by  the  Theorem  2  below,  we  can  extend  the  proof  of  Theorem  1  to  show 
that  the  transaction  ordering  problem  is  NP-complete  in  the  iterative  context  as  well  as  the  non¬ 
iterative  case. 

Definition  5:  The  iterative  transaction  ordering  problem  (also  called  the  transaction  ordering 
problem )  is  defined  as  follows.  Given  an  IPC  graph  Gipc  and  a  positive  integer  k,  does  there  exist 
a  transaction  order  O  such  that  G0T  =  T(Gipc,  O)  satisfies  MCM(G0T )  <  k  ? 

Theorem  2:  The  iterative  transaction  ordering  problem  is  NP-complete. 

Proof:  The  MCM  can  be  found  in  polynomial-time,  therefore,  the  problem  is  in  NP. 

To  establish  NP-hardness,  we  again  derive  a  reduction  from  the  SRTD  problem,  and  we 

modify  the  graph  construction  from  the  proof  of  Theorem  1  so  that  the  MCM  equals  the 
makespan. 

Now  suppose  we  are  given  an  instance  of  the  SRTD  problem  (  7,  r,  l,  d)  with 
T  =  { f j,  t2 . . -tp }  .  We  construct  an  IPC  graph  Gipc  from  this  instance  by  carrying  out  the  follow¬ 
ing  steps.  All  edges  instantiated  are  delayless  unless  otherwise  specified,  and  k  is  equal  to  the 
maximum  deadline  of  the  tasks  in  the  given  instance  of  the  STRD  problem. 

For  each  t  •  e  T, 

i)  instantiate  a  send  actor  w-  when  i  is  odd,  or  a  receive  actor  w-  when  i  is  even  with 
exec{ui )  =  l(tj)  and  proc(Uj)  =  i. 

ii)  instantiate  a  computation  actor  mj  with  execimf  =  r(f.)  and  procimf  =  i. 

iii)  instantiate  a  computation  actor  n-  with  execlir)  =  k-d(t .)  and  pro  ('inf  =  i. 

iv)  instantiate  an  edge  (m-,  iq)  and  another  edge  («(,  «•) . 

v)  instantiate  a  send  actor  st  with  execls  f  =  0  and  proc {s  ■)  =  i. 

vi)  instantiate  a  receive  actor  r  .  with  exec  ( r( )  =  0  and  procf  rf  =  i . 
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vii)  instantiate  a  computation  actor  dl  with  exec(d ()  =  0  and  proc{di)  =  i . 

viii)  instantiate  an  edge  (n-,  s-) ,  an  edge  (s-,  r.) ,  and  another  edge  (r-,  <r/;) . 

ix)  instantiate  another  receive  actor  qt  with  exec(q t)  =  0  and  pwc(qt )  =  p  +  1  (recall 
that  p  =  1 71). 

x)  instantiate  another  send  actor  w-  with  exec{wj)  =  0  and  proc(wi )  =  p  +  1  . 

xi)  instantiate  an  (interprocessor)  edge  (s-,  q t)  and  another  edge  (w-,  r() . 

After  completing  all  the  above,  join  all  c/(s  with  edges  in  a  linear  chain,  instantiate  a  com¬ 
putation  actor  h  with  exec(h)  =  0  and  proc(h)  =  p  +  1 ,  instantiate  edges  (qp,  h)  and  ( h ,  wx) 
and  again  join  all  wi s  with  edges  in  a  linear  chain.  Finally  for  each  of  the  p  +  1  processors,  add 
an  edge  with  a  delay  of  unity  from  the  last  actor  on  the  processor  to  the  first  actor. 

We  again  assume  without  loss  of  generality  that  there  is  an  even  number  of  tasks  in  T . 
Each  send  actor  uj  is  connected  to  the  receive  actor  ui+l  with  an  interprocessor  edge  of  unit 
delay.  Note  that  in  the  OT  graph  T(Gipc,  O) ,  these  interprocessor  edges  become  redundant  (in  the 
sense  of  synchronization  redundancy,  as  discussed  in  [3])  because  of  the  ordered  transaction 
edges  added  due  to  O :  since  the  ordered  transaction  edges  are  connected  by  a  cycle  of  delay 
unity,  the  constraints  imposed  by  { («;,  ul+  , )}  are  automatically  met  by  the  ordered  transaction 
edges. 

This  graph  effectively  represents  a  blocked  schedule  for  an  iterative  graph  when  the  exe¬ 
cution  times  of  the  actors  that  have  been  instantiated  after  step  v)  have  execution  times  that  are 
much  less  than  the  execution  times  of  the  other  actors,  and  the  MCM  of  the  constructed  graph  rep¬ 
resents  the  longest  path  or  the  schedule  length  of  the  graph.  Note  that  each  of  the  longest  paths  in 
the  non-iterative  graph  will  correspond  to  a  cycle  in  the  iterative  case,  where  the  cycle  mean  of  the 
cycle  is  equal  to  the  longest  path  (since  the  denominator  of  the  associated  quotient  in  (3)  is  unity). 
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Similarly,  as  in  the  non-iterative  case,  it  is  possible  to  find  a  one-processor  schedule  of  the  STRD 


instance  that  satisfies  the  constraints  if  we  can  determine  a  transaction  order  whose  enforcement 
will  guarantee  that  the  MCM  of  the  corresponding  OT  graph  is  less  than  or  equal  to  k .  This  is  true 
because  the  communication  actors  that  have  non-zero  IPC  cost  in  the  bus  schedule  of  the  OT 
problem  correspond  to  the  tasks  in  the  valid  schedule  of  the  STRD  problem. 

Hence,  we  can  conclude  that  the  (iterative)  transaction  ordering  problem  is  NP-complete. 

Q.E.D. 


Figure  9.  Constructed  OT  graph  in  Example  4. 
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Example  4:  Consider  again  the  SRTD  instance  of  Example  4.  Figure  9  shows  the  correspond¬ 
ing  ordered  transaction  graph  that  results  when  the  ordering  (wl5  u2,  u4,  u3)  is  imposed.  Remov¬ 
ing  the  OT  edges  (u x,  u2),  (u2,  u4),  ( u4 ,  u3),  (u3,  ux)  gives  the  constructed  IPC  graph.  Note  that 
the  edges  (uv  u2)  and  (u3,  u4)  introduced  during  construction  are  redundant  in  the  OT  graph  due 
to  the  paths  ((m1?  u2))  and  ((w3,  n, ),  (uv  u2),  ( u2 ,  u4)) ,  respectively,  that  are  imposed  by  the  lin¬ 
ear  order  and  have  delays  of  one  or  less. 

5.  The  transaction  partial  order  heuristic 

The  BFB  technique  does  not  take  bus  contention  into  consideration  while  scheduling  the 
transaction  order.  Instead,  it  tries  to  find  a  transaction  order  that  will  be  close  to  or  equal  to  the 
associated  self-timed  schedule.  However,  we  have  demonstrated  that  in  the  presence  of  non-zero 
IPC,  the  OT  method  can,  in  fact,  perform  significantly  better  than  the  ST  method,  and  thus,  more 
direct  consideration  of  OT  execution  is  clearly  worthwhile  when  scheduling  transactions.  For  this 
purpose,  we  propose  in  this  section  a  heuristic,  called  the  transaction  partial  order  (TPO)  algo¬ 
rithm ,  that  simultaneously  takes  IPC  costs  and  the  serialization  effects  of  transaction  ordering  into 
account  when  determining  the  transaction  order.  Note  that  OT  edges  added  to  the  IPC  graph  can 
only  increase  the  MCM  of  the  IPC  graph,  or  leave  the  MCM  unchanged.  The  MCM  of  the  origi¬ 
nal  IPC  graph  therefore  represents  a  lower  bound  on  the  achievable  average  iteration  period.  By 
adding  OT  edges,  we  are  effectively  removing  bus  contention  by  making  sure  that  no  two  com¬ 
munication  actors  submit  conflicting  bus  requests,  and  this  generally  increases  the  MCM  of  the 
IPC  graph.  The  TPO  heuristic  finds  a  transaction  order  on  the  basis  that  an  OT  edge  that  increases 
the  MCM  of  the  IPC  graph  by  a  comparatively  smaller  amount  should  be  given  preference.  There¬ 
fore,  to  determine  which  communication  actor  should  be  scheduled  first,  we  insert  OT  edges 
between  communication  actors  that  are  contending  for  the  bus  (during  the  transaction  ordering 
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process),  and  calculate  the  corresponding  MCM  of  the  IPC  graph.  Actors  whose  corresponding 


MCMs  are  more  favorable  under  such  an  evaluation  are  scheduled  earlier  in  the  transaction  order. 

More  specifically,  a  partial  order  of  the  communication  (send  and  receive)  actors  is  first 
computed  from  the  IPC  graph  Gipc :  the  transaction  partial  order  (TPO)  graph  GTP0  is  com¬ 
puted  by  first  deleting  all  edges  in  Gipc  that  have  nonzero  delays,  and  then  deleting  all  of  the 
computation  actors. 

Example  5:  The  transaction  partial  order  graph  computed  from  the  IPC  graph  of  Figure  2  is 
illustrated  in  Figure  10.  Notice  that  all  the  dependencies  imposed  by  the  IPC  graph  are  retained  in 
GTpq  ,  but  only  for  the  communication  actors. 

The  heuristic  proceeds  by  considering  —  one  by  one  —  each  vertex  of  GTP0  that  has  no 
input  edges  (vertices  in  the  TPO  graph  that  have  no  input  edges  are  called  ready  vertices)  as  a 
candidate  to  be  scheduled  next  in  the  transaction  order.  Interprocessor  edges  are  drawn  from  each 
candidate  vertex  to  all  other  ready  vertices  in  Gjpc ,  and  the  corresponding  MCM  is  measured.  The 
candidate  whose  corresponding  MCM  is  the  least  when  evaluated  in  this  fashion  is  chosen  as  the 


Figure  10.  TPO  Graph  in  Example  5. 
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next  vertex  in  the  ordered  transaction,  and  deleted  from  GPP0  .  This  process  is  repeated  until  all 


communication  actors  have  been  scheduled  into  a  linear  ordering.  A  pseudocode  specification  of 
the  TPO  heuristic  is  given  in  Figures  11-13. 

The  algorithm  makes  sense  intuitively  since  the  dependencies  imposed  by  the  edges 
drawn  from  the  candidate  vertices  will  remain  when  the  transaction  ordering  O  is  enforced. 
These  edges  represent  constraints  in  addition  to  the  interprocessor  edges  that  are  already  present 
in  Gipc  and  thus,  they  can  only  increase  the  MCM  or  leave  the  MCM  unchanged.  Since  we  are 
interested  in  minimizing  the  MCM,  we  choose  candidate  vertices  that  increase  the  MCM  by  the 
least  possible  amounts.  Thus,  the  algorithm  follows  a  greedy  strategy  in  choosing  vertices,  but  it 
explicitly  takes  communication  serialization  and  IPC  costs  into  account 


Function  Choose-Communication-Actor 

Input  an  IPC  graph  G  =  (V,  E) ,  a  TPO  graph  GTP0  and  a  list  of  actors  ReadyList 
Output  a  communication  actor  v 

For  x  e  ReadyList 

For  y  e  ReadyList 
If  x  *  y 

e  =  G.addedge(x,  y) 
temp.  addedge(ej 

end  if 
end  for 

criteria[x ]  =  MCM(G) 
end  for 
For  e  e  temp 

G.  delete  (e) 

end  for 

v  =  min{criteria(x)} 

return  v 

Figure  11.  Function  to  choose  the  next  communication  actor  in  the  transaction  order. 
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Example  6:  When  we  apply  the  TPO  heuristic  to  the  IPC  graph  of  Figure  2,  the  schedule  we 


obtain  is  illustrated  by  the  Gantt  chart  of  Figure  4.  The  corresponding  OT  graph  is  illustrated  in 
Figure  14. 

The  OT  edges  corresponding  to  the  actors  that  have  already  been  scheduled  are  added  as 
the  heuristic  proceeds  since  they  represent  the  schedule  of  the  bus,  and  hence,  make  the  heuristic 
more  accurate  for  the  later  stages  of  the  transaction  order.  The  maximum  number  of  nodes  in  the 
ready  list  at  any  given  instant  is  P  (where  P  is  the  number  of  processors).  The  complexity  of  the 
algorithm  is  thus  0(P\V\  \EOT\ )  since  the  complexity  of  computing  the  MCM  of  a  graph  ( V ,  E) 
is  0(\V\\E\). 

The  edge  of  the  transaction  order  that  connects  the  last  communication  actor  in  the  order¬ 
ing  with  the  first  one  has  a  delay  of  unity  (to  represent  the  transition  to  the  next  graph  iteration). 
We  can  improve  the  performance  of  the  TPO  algorithm  by  introducing  this  edge  at  the  beginning 
because  it  will  give  a  more  accurate  estimate  of  the  MCM  in  choosing  vertices  later  as  the  heuris¬ 
tic  proceeds.  Under  this  modification,  the  heuristic  proceeds  as  before,  except  that  the  “last”  (unit- 
delay)  transaction  ordering  edge  is  drawn  at  the  beginning.  Since  GTP0  has  a  maximum  of  P 


Function  Initialize 
Input  an  IPC  graph  G 

compute  GTPq  from  G 
For  v  e  G 

mark[v]  =  FALSE 
If  indegree (V)  ==0 

ReadyList.  append(v) 
end  if 
end  for 


Figure  12.  Function  to  initialize  data  structures  called  by  the  TPO  function. 
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communication  actors  that  can  be  scheduled  last  in  the  transaction  order,  the  modified  heuristic  is 


repeated  for  each  of  these  candidate  communication  actors  that  can  be  scheduled  in  the  end,  and 
the  best  solution  that  results  is  selected.  This  increases  the  complexity  of  the  algorithm  by  a  factor 
of  P  to  0(P2\V\2\E07\). 


Function  TPO-heuristic 
Input  an  IPC  graph  G  =  (V,E) 

Output  a  linear  list  of  communication  actors  LinearList 

lnitialize((G ) 
complete  =  FALSE 
first  =  TRUE 
while  not  ( complete ) 

v  =  choose-communication-actor((G,  GTP0,  Ready  List) 
mark[v]  =  TRUE 
LinearList.  append(v) 

If  not  {first) 

G.addedge(w,  v) 
w  =  v 

end  if 

first  =  FALSE 

For  u  e  {(v,  u)  e  E} 

flag=  TRUE 

For  s  e  {(5,  u)  e  E } 

If  markfs]  =  FALSE 
flag  =  FALSE 

end  if 
end  for 

If  flag  ==  TRUE 

ReadyList.append  (u) 

end  if 
end  for 

If  {ReadyList.e mpty)  ==  TRUE) 
complete  =  TRUE 

end  if 
end  while 
return  LinearList 

end  Function 


Figure  13.  Pseudocode  for  TPO  heuristic. 
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6.  Genetic  algorithm  for  transaction  scheduling 


Since  the  transaction  ordering  problem  is  intractable,  we  are  unable  to  efficiently  find 
optimal  transaction  orders  on  a  consistent  basis.  We  have  implemented  a  branch  and  bound  strat¬ 
egy  to  explore  the  search  space  comprehensively,  but  this  technique  requires  excessive  amounts 
of  time  for  graphs  that  have  significant  numbers  of  IPC  edges.  To  develop  an  alternative  to  this 
branch  and  bound  approach,  and  the  TPO  heuristic,  we  have  implemented  a  genetic  algorithm 
(GA)  to  search  for  the  best  transaction  order.  The  GA  exploits  the  increased  tolerance  for  compile 


Proc  1  Proc  2  Proc  3 


Figure  14.  OT  graph  obtained  by  applying  TPO  heuristic  in  Example  6. 
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time  that  is  available  for  many  embedded  applications  [14],  and  can  leverage  the  TPO  heuristic  by 
incorporating  its  solution  in  the  “initial  population.” 

In  our  GA  formulation,  candidate  transaction  orders  are  encoded  using  the  matrix-based 
sequence-encoding  method  described  in  [7].  Using  this  method,  the  partial  order  of  the  communi¬ 
cation  actors  is  converted  into  a  precedence  matrix  and  randomly  completed  to  yield  a  random 
transaction  order  that  is  valid.  Mutation  is  carried  out  by  swapping  rows  and  columns,  and  recom¬ 
bination  is  performed  using  the  intersection  operator  explained  in  [7].  The  intersection  operator 
takes  subsequences  that  are  common  among  the  parents  by  taking  the  boolean  “and”  of  the  two 
parent  matrices  to  form  the  “offspring,”  and  the  undefined  part  is  randomly  completed. 

A  pseudocode  sketch  of  the  GA  is  shown  in  Figure  15.  For  details  on  the  underlying  GA 
concepts  (e.g.,  tournament  selection),  we  refer  the  reader  to  [1],  The  mutation  step  takes  0{\V\  ) 
time  multiplied  by  the  number  of  swaps  carried  out  since  each  time  we  have  to  check  whether  the 
swap  was  valid  by  comparing  it  with  the  partial  boolean  matrix  MPP0  corresponding  to  the  trans- 


Function  TransOrderingGA 
Input  an  IPC  graph  G  =  (V,  E) 

Output  a  linear  list  of  communication  actors  LinearList 

compute  Gtp0  from  G 

convert  GTP0  to  boolean  matrix  MTP0 

generate  initial  population  M  by  randomly  completing  MTP0 

For  j  =  1  to  Notterations 

For  /  =  1  to  PopulationSize 
Pt  =  mutate*  (M.) 

Rj  =  recombine (Pt_  v  P{) 

^;-.FitnessValue  =  evaluate (R{,  G) 

end  for 

M  =  tournament_selection(7?,  M) 

end  for 

Figure  1 5.  Pseudocode  for  our  GA  approach  to  transaction  ordering. 


27  of  34 


2 

action  partial  order  graph  GTP0 .  The  recombination  step  takes  0(\V\  )  time,  and  the  evaluation 
step  takes  0(  |  V| \E0T\)  time.  The  overall  complexity  of  each  iteration  is  also  influenced  by  the 
population  size  and  the  overhead  involved  in  generating  random  numbers. 

7.  Dynamic  reordering 

Once  we  obtain  a  transaction  order  (e.g.,  using  the  TPO  heuristic  or  the  GA  approach 
defined  in  Section  6),  it  is  possible  to  swap  the  position  of  consecutive  communication  actors  in 
the  transaction  order  as  long  as  the  new  positions  do  not  violate  the  dependencies  imposed  by  the 
transaction  partial  order.  This  method  has  the  advantage  that  it  cannot  degrade  the  transaction 
order  since  we  can  discard  any  solution  that  is  worse.  The  concept  is  similar  to  dynamic  variable 
reordering  used  in  OBDD’s  (Ordered  Binary  Decision  Diagrams)  [17].  We  have  implemented  an 
adaptation  to  ordered  transaction  scheduling,  called  dynamic  transaction  reordering  ( DTR ),  of  the 
Sifting  Algorithm  introduced  by  Rudell  [21],  and  have  observed  that  from  DTR,  we  consistently 
obtain  improvements  in  the  iteration  period,  regardless  of  the  method  used  to  find  the  transaction 
order. 

8.  Results 

Experiments  were  carried  out  to  compare  the  ST  method  and  the  OT  method,  and  to  mea¬ 
sure  the  performance  of  the  TPO,  GA,  and  DTR  heuristics  in  finding  transaction  orders.  These 
heuristics  were  implemented  in  C/C++  using  the  LEDA  [16]  framework  for  fundamental  graph- 
theoretic  data  structures  and  algorithms.  The  benchmarks  are  standard  DSP  applications  that  have 
been  scheduled  using  the  classic  HLFET  algorithm  [8]  with  straightforward  extensions  to  incor¬ 
porate  IPC  costs. 

The  IPC  graphs  are  fairly  complicated,  ranging  from  between  50-150  nodes,  and  the  num- 
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bers  of  processors  involved  range  from  2  to  8.  The  examples////,  [ft 2,  and  fft3  result  from  three 
representative  schedules  for  Fast  Fourier  Transforms  based  on  examples  given  in  [15];  karplO  is  a 
music  synthesis  application  based  on  the  Karplus  Strong  algorithm  in  10  voices;  and  qmf4  is  a  4 
channel  multi-resolution  QMF  filter  bank  for  signal  compression. 

In  the  simulation  of  the  ST  schedule,  we  ignore  the  overhead  of  synchronization  so  as  to 
give  us  a  worst-case  comparison  with  the  OT  schedule.  In  practice,  of  course,  synchronization  has 
nonzero  cost,  and  thus,  depending  on  the  actual  synchronization  overhead  in  the  target  architec¬ 
ture,  the  benefit  of  the  OT  schedules  examined  will  be  even  more  that  what  the  results  here  dem¬ 
onstrate.  Thus,  our  analysis  in  this  section  gives  a  lower  bound  on  the  improvement  we  can  expect 
using  the  OT  implementation  strategy  in  conjunction  with  our  proposed  transaction  ordering  tech¬ 
niques. 

Table  1  compares  the  performance  (iteration  period)  of  the  ST  and  the  OT  schedules. 
Here,  the  average  iteration  period  ( T(n )  of  the  OT  schedule  is  obtained  by  taking  the  best  perfor¬ 
mance  using  the  algorithms  proposed  in  Sections  5-7,  and  7/T  denotes  the  average  iteration 
period  of  the  corresponding  ST  schedule.  In  each  of  the  cases,  we  see  that  the  OT  strategy  can 
outperform  the  ST  strategy,  and  that  this  holds  even  though  we  are  ignoring  synchronization 
costs,  which  gives  us  a  very  optimistic  view  of  the  performance  under  ST  execution. 


Table  1.  Comparison  of  ST  and  OT  schedules. 


Application 

T 

1  ST 

T 

1  OT 

fftl 

263 

245 

fft2 

312 

300 

fft3 

263 

245 

karplO 

312 

308 

qmf4 

147 

140 
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Table  2  gives  us  a  comparison  between  the  different  heuristics  in  finding  transaction 
orders.  Each  entry  is  the  iteration  period  when  the  transaction  order  found  by  the  heuristic  is 
enforced.  Column  2  shows  the  iteration  period  when  a  randomly-generated  transaction  order  is 
enforced.  From  the  table,  we  can  conclude  that  all  the  heuristics  work  fairly  well  compared  to  the 
random  transaction  order.  The  TPO  heuristic  for  which  the  results  are  shown  is  the  modified  ver¬ 
sion  that  inserts  the  unit-delay  edge  beforehand.  This  consistently  gives  us  a  slight  improvement. 
Generally,  the  TPO  heuristic  works  better  than  the  BFB  technique  —  especially  for  fftl  and  fft3  — - 
and  the  heuristic  that  combines  the  TPO  heuristic  and  DTR  performs  best  (even  better  than  the 
GA  which,  takes  significantly  more  time  to  execute).  The  GA  was  implemented  with  a  population 
size  of  100  and  the  number  of  iterations  was  set  to  1000.  The  GA  for  the  experiments  that  we  tried 
generally  stabilized  before  the  1000  iteration  limit  was  reached. 


Table  2.  Comparison  of  algorithms. 


Application 

T 

random 

T 

1  BFB 

T 

1  TPO 

T 

1  GA 

^TPO+DTR 

fftl 

392 

280 

245 

255 

245 

fft2 

395 

340 

320 

300 

300 

fft3 

390 

300 

255 

255 

245 

karplO 

482 

312 

309 

308 

309 

qmf4 

196 

148 

145 

140 

145 

When  we  use  the  transaction  ordering  obtained  by  the  TPO  heuristic  combined  with  DTR 
in  the  initial  population  of  the  GA,  we  achieve  the  best  results  since  we  simultaneously  obtain  the 
benefits  of  all  three  approaches.  The  results  are  shown  in  Table  3. 


30  of  34 


Table  3.  Results  when  the  GA  is  applied  to  the  TPO  heuristic  in  conjunction 
with  DTR. 


Application 

^TPO+DTR+GA 

fftl 

245 

fft2 

295 

fft3 

245 

karplO 

305 

qmf4 

140 

9.  Conclusions 


We  have  demonstrated  that  in  the  presence  of  accurate  estimates  for  actor  execution  times, 
the  ordered  transaction  method  —  which  is  superior  to  the  self-timed  method  in  its  predictability, 
and  its  total  elimination  of  synchronization  overhead  —  can  significantly  outperform  self-timed 
implementation,  even  though  ordered  transaction  implementation  offers  less  run-time  flexibility 
due  to  a  fixed  ordering  of  communication  operations.  We  have  also  shown  that  in  the  presence  of 
non-zero  IPC  costs,  finding  an  optimal  transaction  order  is  an  NP-complete  problem,  and  we  have 
developed  a  variety  of  heuristic  techniques  to  find  efficient  transaction  orders.  These  techniques 
include  a  low-complexity,  deterministic  heuristic  for  rapid  design  space  exploration,  and  a  genetic 
algorithm  for  exploiting  extra  compile  time  when  generating  final  implementations.  Useful  direc¬ 
tions  for  further  work  include  integrating  transaction  ordering  considerations  into  the  scheduling 
process,  and  the  exploration  of  hybrid  scheduling  strategies  that  can  combine  ordered  transaction, 
self-timed,  and  fully-static  strategies  in  the  same  implementation  based  on  subsystem  characteris¬ 
tics. 
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